A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora

نویسنده

  • Pascale Fung
چکیده

We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/Indo-European language pairs. Tagging information of one language is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise elimination techniques are introduced. We obtained a 73.1% precision. We also show how the results can be used in the compilation of domain-specific noun phrases. 1 Bilingual lexicon compilation w i t h o u t s e n t e n c e a l i g n m e n t Automatically compiling a bilingual lexicon of nouns and proper nouns can contribute significantly to breaking the bottleneck in machine translation and machine-aided translation systems. Domain-specific terms are hard to translate because they often do not appear in dictionaries. Since most of these terms are nouns, proper nouns or noun phrases, compiling a bilingual lexicon of these word groups is an important first step. We have been studying robust lexicon compilation methods which do not rely on sentence alignment. Existing lexicon compilation methods (Kupiec 1993; Smadja & McKeown 1994; Kumano & Hirakawa 1994; Dagan et al. 1993; Wu & Xia 1994) all attempt to extract pairs of words or compounds that are translations of each other from previously sentencealigned, parallel texts. However, sentence alignment (Brown et al. 1991; Kay & RSscheisen 1993; Gale & Church 1993; Church 1993; Chen 1993; Wu 1994) is not always practical when corpora have unclear sentence boundaries or with noisy text segments present in only one language. Our proposed algorithm for bilingual lexicon acquisition bootstraps off of corpus alignment procedures we developed earlier (Fung & Church 1994; Fung & McKeown 1994). Those procedures attempted to align texts by finding matching word pairs and have demonstrated their effectiveness for Chinese/English and Japanese/English. The main focus then was accurate alignment, but the procedure produced a small number of word translations as a by-product. In contrast, our new algorithm performs a minimal alignment, to facilitate compiling a much larger bilingual lexicon. The paradigm for Fung ~: Church (1994); Fung & McKeown (1994) is based on two main steps find a small bilingual primary lexicon, use the text segments which contain some of the word pairs in the lexicon as anchor points for alignment, align the text, and compute a better secondary lexicon from these partially aligned texts. This paradigm can be seen as analogous to the Estimation-Maximization step in Brown el al. (1991); Dagan el al. (1993); Wu & Xia (1994). For a noisy corpus without sentence boundaries, the primary lexicon accuracy depends on the robustness of the algorithm for finding word translations given no a priori information. The reliability of the anchor points will determine the accuracy of the secondary lexicon. We also want an algorithm that bypasses a long, tedious sentence or text alignment step. 2 A l g o r i t h m o v e r v i e w We treat the bilingual lexicon compilation problem as a pattern matching problem each word shares some common features with its counterpart in the translated text. We try to find the best representations of these features and the best ways to match them. We ran the algorithm on a small Chinese/English parallel corpus of approximately 5760 unique English words. The outline of the algorithm is as follows: 1. Tag the English ha l f of t he para l le l t ex t . In the first stage of the algorithm, only English words which are tagged as nouns or proper nouns are used to match words in the Chinese text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora

We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/Indo-European language pairs. Tagging information of one language is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise eli...

متن کامل

Extraction and Analysis of English Noun-Noun Compounds with Chinese-English Parallel Corpora

Noun-noun compound is a common type of multiword expression in English. It causes problems in natural language processing as many other kinds of MWEs. In this paper, we extract noun-noun compounds using their POS tags. Then the extracted nounnoun compounds are aligned to their Chinese translations using word alignment method. The statistical analysis of the alignments shows that English noun-no...

متن کامل

Collocational Clashes in the Persian Translations of Tuesdays with Morrie

This study aimed at finding features of collocational deviations in the translations of Tuesdays with Mor- rie. In this direction, categories of collocations and collocational clashes, as well as causes of collocation- al clashes were explored. The present work investigated five Persian translations of the novel. All the books were examined completely and all possible collocational clashes were...

متن کامل

Using Comparable Corpora to Adapt a Translation Model to Domains

Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains. To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the words asso...

متن کامل

Disambiguation of Single Noun Translations Extracted from Bilingual Comparable Corpora

s of papers of four academic societies, namely Japan Architecture Society (JAS), Institute of Electric Engineering (IEE), Institute of Electronics and Communication Engineering (IECE), and Information Processing Society of Japan (IPSJ), published in Japan. Numbers of abstracts of each of these corpora are shown in Table 1. Parts of these bilingual corpora are parallel. The percentages of parall...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995